Alex Brasch

Reference & Startup

The documentation and examples within this tutorial were gleaned from the following resources:

tidycensus
tigris
tidyverse
Census Developers
Census Geography Program
Leaflet for R

As noted by its author, Kyle Walker, “tidycensus is an R package that allows users to interface with the US Census Bureau’s decennial Census and American Community Survey (ACS) APIs and return tidyverse-ready data frames, optionally with simple feature (sf) geometry included.”

To get started, load the tidycensus and tidyverse packages. Additional packages used within this RMarkdown file include tigris, readxl, writexl, arcgisbinding, leaflet, kableExtra, janitor, and extrafont.
A Census API key is also required, which can be obtained from http://api.census.gov/data/key_signup.html. Entry of the key using the census_api_key function only needs to occur once (i.e., it is tied to RStudio, rather than a single R script or Markdown file).

Install and load packages (uncomment as-needed).

The following section contains examples of using the tidycensus package and Census API to retrieve, prepare, reshape, and blend demographic data for analysis and visualization. To save on processing time and avoid local memory limitations, data-intensive code chunks have been commented out (e.g., retrieving large amounts of census blocks using the tidycensus or tigris packages). Readers can view the underlying code while the input data is read-in as part of the R Project’s local data.

Review Variables

name label concept
H001001 Total HOUSING UNITS
H002001 Total URBAN AND RURAL
H002002 Total!!Urban URBAN AND RURAL
H002003 Total!!Urban!!Inside urbanized areas URBAN AND RURAL
H002004 Total!!Urban!!Inside urban clusters URBAN AND RURAL
H002005 Total!!Rural URBAN AND RURAL
name label concept
B00001_001 Estimate!!Total UNWEIGHTED SAMPLE COUNT OF THE POPULATION
B00002_001 Estimate!!Total UNWEIGHTED SAMPLE HOUSING UNITS
B01001_001 Estimate!!Total SEX BY AGE
B01001_002 Estimate!!Total!!Male SEX BY AGE
B01001_003 Estimate!!Total!!Male!!Under 5 years SEX BY AGE
B01001_004 Estimate!!Total!!Male!!5 to 9 years SEX BY AGE

Usage

Example 1: Decennial Census

Retrieve the 2010 decennial census total population for all U.S. states.

  • Define a single, overarching location by default
  • Define a single variable manually
GEOID NAME P001001
01 Alabama 4779736
02 Alaska 710231
04 Arizona 6392017
05 Arkansas 2915918
06 California 37253956
22 Louisiana 4533372

Retrieve 2010 decennial census variables and geometries for a specific set of geographies

  • Define multiple locations manually
  • Define and name multiple variables manually
  • Retrieve geometries
GEOID NAME pop_tot pop_sex_m_tot pop_sex_f_tot
41005 Clackamas County, Oregon 375992 184925 191067
41051 Multnomah County, Oregon 735334 363645 371689
41067 Washington County, Oregon 529710 260524 269186

Example 2: ACS

ACS data differ from decennial census data in that ACS data are based on an annual sample of households, rather than a complete enumeration. In turn, ACS data points are estimates characterized by a margin of error (MOE). tidycensus will always return the estimate and MOE for any requested variables. When requesting ACS data with tidycensus, it is not necessary to specify the “E” or “M” suffix for a variable name. Available survey types include ACS 5-year estimates (acs5) or ACS 1-year estimates (acs1). Note that the latter is only available for geographies with populations of 65,000 and greater.

Retrieve 2018 ACS 5-year variables and geometries for a specific set of geographies.

  • Define multiple locations manually
  • Define and name multiple variables manually
  • Retrieve geometries, specifically the TIGER/Line shapefiles

Concerning geometries, tidycensus used the geographic coordinate system NAD 1983 (EPSG: 4269), which is the default for Census spatial data files. tidycensus uses the Census cartographic boundary shapefiles for faster processing; if you prefer the TIGER/Line shapefiles (i.e., Topologically Integrated Geographic Encoding and Referencing), set cb = FALSE in the function call. Per Census documentation, the cartographic boundary files are simplified representations of selected geographic areas from the Census Bureau’s Master Address File (MAF)/TIGER geographic database. These boundary files are specifically designed for small scale thematic mapping. When possible, generalization is performed with intent to maintain the hierarchical relationships among geographies and to maintain the alignment of geographies within a file set for a given year. To improve the appearance of shapes, areas are represented with fewer vertices than detailed TIGER/Line equivalents. Some small holes or discontiguous parts of areas are not included in generalized files. Generalized boundary files are clipped to a simplified version of the U.S. outline. As a result, some off-shore areas may be excluded from the generalized files. Consult this TIGER Data Products Guide to determine which file type is best for your purposes.

GEOID NAME hh_medinc_totalE hh_medinc_totalM hh_foodst_totalE hh_foodst_totalM hh_foodst_recE hh_foodst_recM
41005 Clackamas County, Oregon 76597 1050 155456 776 16694 859
41067 Washington County, Oregon 78010 1086 216507 1013 22213 1020
41051 Multnomah County, Oregon 64337 779 321968 1216 55018 1390

Concerning the structure of the data frame, a wide format contains a single row for each observation with many columns representing all variables (human-readable), while a long/tidy format contains many rows per observation (assuming more than one variable) with name-value pairs for each variable and associated value (machine-readable. For more details, see Hadley Wickham’s seminal paper Tidy Data.

Retrieve 2015 ACS 1-year variables and geometries for a specific set of geographies.

  • Define multiple locations manually
  • Define and name multiple variables manually
  • Retrieve geometries, specifically the cartographic boundaries
  • Output in long/tidy format

Compare the structure of the data sets.

GEOID NAME hh_medinc_totalE hh_medinc_totalM hh_foodst_totalE hh_foodst_totalM hh_foodst_recE hh_foodst_recM
41005 Clackamas County, Oregon 76597 1050 155456 776 16694 859
41067 Washington County, Oregon 78010 1086 216507 1013 22213 1020
41051 Multnomah County, Oregon 64337 779 321968 1216 55018 1390
GEOID NAME variable estimate moe
41005 Clackamas County, Oregon hh_medinc_total 69629 2898
41005 Clackamas County, Oregon hh_foodst_total 152414 1886
41005 Clackamas County, Oregon hh_foodst_rec 17527 1737
41051 Multnomah County, Oregon hh_medinc_total 59231 2100
41051 Multnomah County, Oregon hh_foodst_total 311797 3172
41051 Multnomah County, Oregon hh_foodst_rec 61844 2753
41067 Washington County, Oregon hh_medinc_total 70447 1703
41067 Washington County, Oregon hh_foodst_total 211139 2446
41067 Washington County, Oregon hh_foodst_rec 24253 2324

Retrieve 2018 ACS 5-year variables and geometries for all geographies within a larger geography (e.g., all counties within a state).

  • Define multiple locations with a vector
  • Define and name multiple variables manually
  • Retrieve geometries, specifically the cartographic boundaries
GEOID NAME variable estimate moe
41001 Baker County, Oregon hh_medinc_total 43921 2509
41001 Baker County, Oregon hh_foodst_total 6927 271
41001 Baker County, Oregon hh_foodst_rec 1280 168
41003 Benton County, Oregon hh_medinc_total 58655 1968
41003 Benton County, Oregon hh_foodst_total 35056 451
41003 Benton County, Oregon hh_foodst_rec 4082 486

Example 3: Tigris

In some cases, you may want use the tabular and spatial data separately or may want to join the two data sets after analysis. In those cases, tidycensus can be used in combination with the tigris package.

Retrieve tract data for all counties within Oregon.

  • Use a vector of variables
  • Use a vector of locations
  • Use tidyensus to retrieve tabular data
  • Use tigris to retrieve spatial data
  • Join the tabular and spatial data

Retrieve shapes for all Oregon tracts via the tigris package.

Join the attributes to the geometries.

Note that the resulting object’s class is dependent on the join order. The left side’s class takes priority; therefore, in the above, the attributes (right side) are being joined to the geometries (left side), so the resulting object class is sf. If the order is flipped (below) and the geometries (right side) are joined to the attributes (let side), the object class is not sf. To make it so, add %>% st_as_sf()

Example 4: Aggregation

[At a time in 2019] Retrieving all decennial census block group data for a specified state or county generates an error. This has since been resolved, but it provides a good example of how smaller nested geographies can be aggregated to larger geographies (e.g., block to block groups).

This works now…

But if it didn’t…

Retrieve block data, create the block group GEOID by removing the last 3 characters in the block GEOID, and group by/summarize to block group.

Aggregate data to block groups.

Compare the data sets.

GEOID NAME P001001.x P001001.y
530330001001 Block Group 1, Census Tract 1, King County, Washington 1250 1250
530330001002 Block Group 2, Census Tract 1, King County, Washington 1234 1234
530330001003 Block Group 3, Census Tract 1, King County, Washington 1337 1337
530330001004 Block Group 4, Census Tract 1, King County, Washington 1492 1492
530330001005 Block Group 5, Census Tract 1, King County, Washington 942 942
530330002001 Block Group 1, Census Tract 2, King County, Washington 1086 1086

Visualization

ggplot2 is a data visualization package that is part of the tidyverse.

Create a plot of 2010 decennial census state populations.

Create a plot of 2014-2018 ACS 5-year estimates and MOEs for all Oregon counties.

Create a choropleth map of a single variable across geographies within a single county.

Created faceted choropleths maps of multiple variables across geographies within a single county.

As mentioned by Kyle Walker in his tidycensus tutorial, “one of the most powerful features of ggplot2 is its support for small multiples, which works very well with the tidy data format returned by tidycensus. Many Census and ACS variables return counts, which are generally inappropriate for choropleth mapping. In turn, get_decennial and get_acs have an optional argument, summary_var, that can work as a multi-group denominator when appropriate.” For example, view the racial/ethnic population distribution within a given county.

Create an unformatted, interactive map using mapview.

Create a formatted, interactive map using leaflet.

 

A work by Alex Brasch